The first step in this process is generating a URL for each page of the forum. Each page has 10 posts, and as of the time of starting this project (5/1/2019). The first post’s URL is the website’s base URL followed by the forum ID number. All subsequent pages are numbered with a “-#” before the final forward-slash. I use a simple for loop to generate a vector of all the URLs in the forum. Next, I make an empty dataframe to put the scraped information into. I gather the following:
#generate a url for each page of the ideology and philosophy forum
ideo_philo_urls <- c("https://www.stormfront.org/forum/t451603/")
#generate a url for each page
for(i in 2:502){
ideo_philo_urls <- c(ideo_philo_urls, paste0("https://www.stormfront.org/forum/t451603-",
i,
"/"))
}
ideology_forum <- data.frame(user = c(),
date = c(),
time = c(),
text = c())
Now that I have all the URLs of the forum pages in a vector and an empty dataframe to save them in execute the following for loop to scrape all of the data from the forum and put it into a dataframe including the variables mentioned above. I scrape the data in 3 parts: the text itself, the date and time together, and then the usernames. The corresponding parts of the webpage scraping are labeled in the code below. I use the stringr package to extract the data that I want.
In the current date of compilation the last forum page does not have a full 10 comments but the comment extraction temporary object still has a length of 10 while the date, time, and user vectors have fewer than 10. To avoid this mistake causing an error and haulting the knit of the document I specify that the loop adds the new posts to the full dataframe for all loops except the last one. I add the last set of posts into the dataframe separately.
for(i in 1:length(ideo_philo_urls)){
page <- read_html(url(ideo_philo_urls[i]))
#read the text from the posts
page_text_prelim <- page %>%
html_nodes("#posts .alt1") %>%
html_text()
#extract the text from the posts. Every other index in this vector is the post, with the remaining indices being missing.
page_text <- page_text_prelim[seq(1, 20, 2)]
page_date_time <- page %>%
html_nodes("#posts .thead:nth-child(1)") %>%
html_text()
page_date_time_prelim <- page_date_time %>%
data.frame() %>%
janitor::clean_names() %>%
mutate(date = stringr::str_extract(x,
"\\d{2}\\-\\d{2}\\-\\d{4}"),
time = stringr::str_extract(x,
"\\d{2}\\:\\d{2}\\s[A-Z]{2}")) %>%
filter(!is.na(date)) %>%
select(date,
time)
page_date <- as.vector(page_date_time_prelim$date)
page_time <- as.vector(page_date_time_prelim$time)
page_user_prelim <- page %>%
html_nodes("#posts .alt2") %>%
html_text() %>%
data.frame() %>%
janitor::clean_names() %>%
mutate(text = as.character(x),
user_time_detect = as.numeric(stringr::str_detect(text,
"Posts:")),
user = stringr::str_extract(text,
"([A-z0-9]+.)+")) %>%
filter(user_time_detect == 1) %>%
select(user)
page_user <- as.vector(page_user_prelim$user)
#as of 5/6/2019 this errors on the final loop because the last page only has 7 posts and the page_date and page_time. I have the following if condition to prevent the last loop from erroring.
if(i < 502){
page_df <- data.frame(user = as.character(page_user),
date = as.character(page_date),
time = as.character(page_time),
text = as.character(page_text))
ideology_forum <- rbind(ideology_forum, page_df)
}
}
#This deals with that last loop that failed
page_text <- as.vector(na.omit(page_text))
page_df <- data.frame(user = as.character(page_user),
date = as.character(page_date),
time = as.character(page_time),
text = as.character(page_text))
ideology_forum <- rbind(ideology_forum, page_df)
The code below is cleaning the data captured above. There are several problems with irrelevant text in the posts. The first problem is that each post has the first three word-like objects as “Re: National Socialism” because that is the name of the forum. These three words are not relevant to actually discerning what is being discussed and is therefore removed. The second problem is that many of the posters quote each other and outside materials in their back and forth. In this project I am only interested in novel components of each post. Thus, I remove all quoted text from each post. The third problem is line breaks and other control characters. Fourth, I remove all punctuation from the text for more succinct analysis.
The column “text_nore” is the post itself without the initial indicator that it is a response to the forum. Removing the text in this context is pretty straightforward because I only care about exclusively one phrase that does not appear elsewhere.
The column “text_noquote” is the text of the post remaining from text_nore also minus the text in quotes. This was a rather challenging piece of text to address. There is an example post in its raw form below that shows just how tricky this part was to solve. The selected example has three quotes: the first names the user quoted, and the following two do not. These two quotes have an inconsistent form, and thus make it difficult to capture all possible different quotes with one regex pattern. However, enough things are the same to make it work. First, all quotes start with the word “Quote:”, so I can easily identify the start of a quote. Second, all quotes end with two line breaks “\n\n”. In between those posts there are several words, control characters, and punctuation. In order to capture these patterns I use the greediest (and laziest) approach that works: match 0 or more of a pattern that may or may not exist within a quote until the two line breaks are matched at the end of the quote. This ultimately works on all quote types, and the final regex form can be seen below.
## [1] "\n\n\nRe: National Socialism\n\n\n\nQuote:\n\n\nOriginally Posted by Garak\n\n\nEver heard of copy and paste? I'll bet you could find it before I could. Give me a page number for quicker reference perhaps.\n\nSurely, in the same time it took you to post the question of what the difference between Socialism and National Socialism is, plus these other bickering posts, you could of \"copy and pasted\" all you wanted.\nQuote:\n\nHandouts? What the hell are you talking about. If asking you to clarify your position is a handout in your mind your little movement won't go anywhere.\n\nYou come into my thread with an interest in national socialism, but you don't bother to read the thread at all. Instead, you expect everyone else to compensate your laziness by digging through and finding it themselves for you, when we've already done our fair share of explaining it ourselves.Read the thread.\nI'm not even a National Socialist, and it annoys me that you would threaten to \"not support NS\" if we don't beckon to your will. NSers are our brothers, all the same. Unity makes us powerful, dissent breaks us.\nQuote:\n\nBTW, where are you in ND?\n\nWith how you've been acting, I'm not sure if I want to tell you.I don't want this thread to turn toward further argument unrelated to the topic.\n\n\n"
The problem of removing quotes posed another ‘unsolvable’ problem: some posts caused the mutate line for creating “text_noquote” to hang and never finish no matter how long it ran. I isolated 8 posts that were causing this problem via a manual binary sort until I identified the posts causing the problem. The only solution seems to be removing these posts, which is unfortunate. However, I have just over 5000 comments, so it is not THAT big of a deal.
The final two problems are trivial to solve. I remove all control characters into the column “text_nobreak” with the “\c” regex pattern. I remove all punctuation with the “[[:punct:]]” regex pattern. Therefore, the final column created in the code chunk below, text_nopunct, has the cleanest form.
This section will provide some visualizations of who is posting, what they are saying, and how much is in their post. The first figure below shows some simple summary information about the top 50 posters. As is obvious from the graph kayden is by far the most frequent poster, followed up by Kaiserreich and John Hawkwood. There is a large discrepency between the top threee themselves, and the top three and the rest of the posters. Each of the top three are separated by about 100 posts, and only 11 users have posted more than 100 times. Another interesting note from this figure is that the top posters are certainly not examples of ‘post frequently, but post little.’ The bars are all shaded with darker shades indicating more words per post (numbers labels the bars are length/10), and each 100+ poster has at least 60 words per post, indicating that they are contributing in some meaningful way to the debate in the forum.
The following figure shows the relationship, or lack thereof, between the number of posts and the length of the posts. Using the full dataset the OLS line has basically no relationship between frequency and length of posts, an OLS coefficient of only 0.008. It seems that for those underneath the threshold of 100 posts the relationship is much stronger and postive, but after subsetting the same plot (not shown here) I find that excluding the top posters does not make the relationship stronger.
The next visualization that I want is a full timeline of the post history on the forum itself and a timeline of the activity of all users with over 100 posts. The code chunk below shows the setup of these dateframes for these plots. For the full timeline of the post I simply group the scraped data by date, and sum a variable that is equal to one for all posts, thus giving me a value for the number of posts per day. I then create a timeline dataframe for all possible days between the first day of activity and the last day of activity. I use this timeline dataframe to left join with the posts-per-day dataframe to make sure that I have values of zero for all variables in days in which there were no posts. This is necessary to assure that I have all days, and am not excluding any because of no posts existing on any given date.
Similarly, in this chunk I also create a dataframe with a user-month unit of analysis for the users with over 100 posts. I follow the same procedure as above, but I ultimately group the data by month in this case instead of date. This is because the full timeline ultimately has a lot of zero-post days when considering all users, so aggregating to monthly intervals is a good way to assure that I still have within-year variation but having fewer zero values. It is also a more aesthetically pleasing and understandable plot.
#Create a user-month dataframe for all users
user_time <- cleaning %>%
separate(date,
into = c('m', 'd', 'y'),
sep = '-',
remove = F) %>%
mutate(date = as.Date(ISOdate(y, m, d)),
ym = paste0(y, "-", m)) %>%
group_by(user, ym) %>%
add_count(user,
name = "n_posts") %>%
summarise(mean_length = mean(length),
n_posts = mean(n_posts)) %>%
arrange(ym) %>%
ungroup()
#Isolate the users with more than 100 posts
user_100_time <- user_time %>%
mutate(user = as.character(user)) %>%
group_by(user) %>%
mutate(total_posts = sum(n_posts)) %>%
filter(total_posts >= 100)
#Create a vector long enough to set as a column in the monthly timeline.
top_users <- unique(user_100_time$user)
top_users_1507 <- c()
for(i in 1:137){
top_users_1507 <- c(top_users_1507, top_users)
}
#Create a posts per day dataframe from scraped data
posts_day <- cleaning %>%
separate(date,
into = c('m', 'd', 'y'),
sep = '-',
remove = F) %>%
mutate(date = as.Date(ISOdate(y, m, d)),
post = 1) %>%
group_by(date) %>%
summarise(mean_length = mean(length),
n_posts = sum(post),
n_users = n_distinct(user)) %>%
arrange(date) %>%
ungroup()
#Create full possible monthly timeline
timeline_monthly <- data.frame(year = 2008:2019) %>%
uncount(12) %>%
group_by(year) %>%
mutate(month = sprintf("%02d", seq_along(year)),
ym = paste0(year, "-", month)) %>%
filter(ym <= "2019-05") %>%
ungroup() %>%
select(ym) %>% #137 total months
uncount(length(top_users)) %>% #11 top users * 137 possible months = 1507 user-months
mutate(user = top_users_1507)
#Join the Top User dataframe with the full monthly timeline, and set missings to 0.
user_month <- left_join(timeline_monthly, user_time) %>%
mutate(n_month = replace_na(n_posts,
0),
mean_length = round(replace_na(mean_length,
0),
2),
mean_length10 = round(mean_length/10,
2)) %>%
group_by(user) %>%
mutate(n_total = sum(n_month)) %>%
ungroup() %>%
select(ym,
user,
n_month,
n_total,
mean_length10) %>%
arrange(user,
ym,
n_total) %>%
as.tbl()
#Create a full daily timeline
timeline_daily <- data.frame(date = seq.Date(min(posts_day$date),
max(posts_day$date),
"day"))
#Join the full daily timeline with the posts per day dataframe above and set missings to 0.
posts_daily <- left_join(timeline_daily, posts_day) %>%
separate(date,
into = c('y', 'm', 'd'),
sep = '-',
remove = F) %>%
mutate(number_posts = ifelse(!is.na(n_posts),
n_posts,
0),
number_users = ifelse(!is.na(n_users),
n_users,
0),
mean_length10 = ifelse(!is.na(mean_length),
mean_length/10,
0),
ym = paste0(y, "-", m) ) %>%
arrange(date)
The Full Forum Post History plot below is a bar chart of the number of posts per day throughout the history of the forums activity. It is clear that the beginning of the forum is by far the most active. It has the most posts, and the most users per day (represented by darker columns). Another obvious conclusion is that about a year and a half (June 2011 - December 2012) has no activity. This is somewhat curious. It does not seem to line up with anything in the mass media that I can find. My first thought was that perhaps the website was shut down for a while but I can find no evidence backing this up.
Notably, some of the most intense activity on the site is between Barrack Obama’s formal entry into the presidential race and the 2008 presidential election. It is even more curious as a result that the last year and a half of Obama’s first term is empty.
The following scatterplot shows the relationship between the number of words per post, the number of posts per day, and the number of users per day. Generally speaking, the only days with long posts on average have few posts, and the only days with short posts on average are those with many posts, but this relationship is definitively nonlinear. The only days with many users also tend to have shorter but more frequent posts.
Looking at the post history of the top 11 users presents an interesting picture of who dominated the conversation in different contexts over the first 4 years of the data. None of the top posters are active consistently in the last half of the post, so I do not include the last half in this plot.
Kaiserreich, the user who started the forum, was very active originally but did not maintain his activity past the first year. Kayden, the top poster, is pretty consistently active throughout the post with increases in activity periodically but generally showing a smooth trend. Many of the others are very isolated but intense in the frequency of their posts: John Hawkwood and KB Mansfield in particular have a few intense months of posting, but do not post outside of those windows. It seems likely that most of the top posters are top posters not because of their consistent dedication to discussion in the forum, but because of their individual dedication to particular discussion in the forum. Notably, John Hawkwood and Rob Jones peak at the same time at the end of 2008, perhaps the were arguing with each other?
Next, I break down the posts into two different formats for analysis: unigrams (one word phrases) and bigrams (two word phrases). This breaks each post into one observation per word-like-object in the post. A post that is originally one row and has 50 words in it becomes 50 rows, one per word (or two-word pair in the case of bigrams).
The unigram object is created below, and for the making the plot more readable I take only the top 50 most used words. Some words I manually correct for cognates and misspellings so that I do not have multiple observations per word. Many of these phrases also need to be condensed to one-word phrases, such as “National Socialism” or “Mein Kampf.” The unigram framework treats these as two different words, but they certainly represent one concept together.
data("stop_words")
library(SnowballC)
unigram <- cleaning %>%
mutate(text_clean = str_replace_all(text_nopunct,
"[Nn]ational\\s[Ss]ocialism[A-z]*",
"ns") %>%
str_replace_all("[Nn]ational\\s[Ss]ocialist[A-z]*",
"ns") %>%
str_replace_all("[Aa]dolf(\\s)*[Hh]itler[A-z]*",
"hitler") %>%
str_replace_all("[Hh]itlers*",
"hitler") %>%
str_replace_all("[A-z]*[Mm]ein(\\s)*[Kk]ampf[A-z]*",
"meinkampf") %>%
str_replace_all("[Nn]ationalis[tm]",
"national") %>%
str_replace_all("[A-z]*([Tt]hird|3rd)(\\s)*[Rr]eich[A-z]*",
"3rd_reich") %>%
str_replace_all("[Nn]azi",
"nazi")) %>%
unnest_tokens(word,
text_clean) %>%
select(user,
time,
date,
word,
id) %>%
anti_join(stop_words)
unigram <- unigram %>%
mutate(word_stem = wordStem(word),
word_stem = ifelse(word == 'hitler' | word == 'ns' | word == 'meinkampf',
as.character(word),
word_stem))
unigram_count <- unigram %>%
count(word, sort = T) %>%
mutate(word = reorder(word, n)) %>%
mutate(word_stem = wordStem(word),
word_stem = ifelse(word == 'hitler' | word == 'ns' | word == 'meinkampf',
word,
word_stem))
The figure below shows the frequency of the 50 most commonly used words. Many of these results are unsurprising, “ns” is the top word, which is what I recoded to mean “national socialism”. This is the name of the forum, so it is expected that it is one of the most commonly used words. The fact that this expectation is met offers at least some confidence that the topic of conversation within the forum remains on topic, rather than digressing into other things. Many of the words leftover have to do with race, such as “race”, “white”, “jewish”, “aryan”, and other similar references. This website is associated strongly with the Ku Klux Klan (KKK) and other far-right white-extremist racist ideologies. Much of the content of this thread is morally reprehensible and vitriolic in its discussion of race. Other common phrases discuss themes of Nazi Germany (e.g. “hitler”, “germany”, “german”), nationalism, war, and politics. Surprisingly, any reference to the word “nazi” itself is not in the top 50 words, but manually searching the data suggests this is because of the many forms this can take, such as “neo-nazi”, “nazism”, and other forms.
The following figure shows the stemmed unigram counterpart of the unigram chart above. Many of the unigrams are similar, but stemming the words allowed many new words to rise to the top. I stem the words using the SnowballC package. This will reduce words to their etymological stem. Now many words that refer to the same concept have the same stem, and are collapsed into a single bar in the chart. While these word frequency charts are difficult to get much inference from on their own, you can see that stemming the words increases the topic variety of the most used words. Note that “nazi” is now one of the top words but it was not before. The top 50 words are in a table in the appendix with their word stems next to them for curious readers.
#Tidytext does not have the option to remove stop words from sentences, so I use qdap for this.
library('qdap')
data("Top200Words")
#Remove stop words from sentences to make bigrams more meaningful and unnest into bigrams
bigram <- cleaning %>%
mutate(text_clean = str_replace_all(text_nopunct,
"[Nn]ational\\s[Ss]ocialism[A-z]*",
"ns") %>%
str_replace_all("[Nn]ational\\s[Ss]ocialist[A-z]*",
"ns") %>%
str_replace_all("[Aa]dolf(\\s)*[Hh]itler[A-z]*",
"hitler") %>%
str_replace_all("[Hh]itlers*",
"hitler") %>%
str_replace_all("[A-z]*[Mm]ein(\\s)*[Kk]ampf[A-z]*",
"meinkampf") %>%
str_replace_all("[Nn]ationalis[tm]",
"national") %>%
str_replace_all("[A-z]*([Tt]hird|3rd)(\\s)*[Rr]eich[A-z]*",
"3rd_reich") %>%
str_replace_all("[Nn]azi",
"nazi") %>%
rm_stopwords(Top200Words,
separate = FALSE)) %>%
unnest_tokens(bigram,
text_clean,
token = "ngrams",
n = 2) %>%
select(user,
time,
date,
bigram,
id)
#Use the tidytext way on the stop-word-removed data using the qdap approach. Similar Results.
bigram_split <- bigram %>%
separate(bigram, c("word1",
"word2"),
sep = " ",
remove = F) %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word1 %in% stop_words$word) %>%
filter(!is.na(bigram)) %>%
count(bigram, sort = T) %>%
mutate(bigram = reorder(bigram, n))
#Count the frequency of bigrams and order them by frequency.
bigram_sorted <- bigram %>%
count(bigram, sort = T) %>%
mutate(bigram = reorder(bigram, n))
The bigram plot below offers a beter view of some of the topics discussed in the forum. The top bigrams are mostly related to Hitler, WWII Germany, nazism, and many forms of discussing National Socialism. They also discuss economic systems and many racial themes. Interestingly, some of the top bigrams are quite violent, such as “fight against” and “war against,” both of which may be advocating war to change current political structures. If one were to read the actual forum it is relatively common that they advocate strong and radical political change through large scale wars as they believe it is the only way national socialism can take hold in the US. By far the most common phrases used have to do with different ways of depicting the white race, examples include: “white race”, “white national(s)”, “white nation,” etc. If these represented one bar, that bar would double the current top phrase of “3rd Reich.”
bigram_sorted %>%
na.omit(bigram) %>%
slice(1:50) %>%
ggplot(aes(y = n,
x = bigram)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(title = "Most Common Unstemmed Bigrams",
x = "Bigram",
y = "Frequency") +
theme(axis.text = element_text(size = 20),
axis.title = element_text(size = 30),
title = element_text(size = 35),
plot.title = element_text(hjust = .5))
# Categorize into topics
unigram_nrc <- get_sentiments('nrc') %>%
filter(word != "white") %>% #white is neutral in this context
inner_join(unigram) %>%
add_count(sentiment,
sort = T) %>%
inner_join(user_rank)
unigram_nrc_user<- get_sentiments('nrc') %>%
filter(word != "white") %>% #white is neutral in this context
inner_join(unigram) %>%
group_by(user) %>%
add_count(sentiment,
sort = T) %>%
inner_join(user_rank)
# Binary positive or negative
unigram_bing <- inner_join(unigram, get_sentiments("bing")) %>%
add_count(sentiment,
name = "n_sent") %>%
add_count(sentiment,
index = id,
name = "n_sent_id") %>%
add_count(id,
name = "num_words") %>%
inner_join(user_rank)
# -5 to 5 negative to positive
unigram_afinn <- inner_join(unigram, get_sentiments("afinn")) %>%
group_by(id) %>%
separate(date,
into = c('m', 'd', 'y'),
sep = '-',
remove = F) %>%
mutate(net_score = sum(score),
date = as.Date(ISOdate(y, m, d))) %>%
select(user,
date,
id,
word,
score,
net_score) %>%
inner_join(user_rank)
unigram_nrc %>%
group_by(sentiment) %>%
summarise(n = mean(n)) %>%
ggplot(aes(x = sentiment,
y = n)) +
geom_col() +
labs(title = "Overall NRC Sentiment Distribution",
y = "Frequency",
x = "Sentiment") +
theme_bw() +
theme(axis.text = element_text(size = 20),
axis.text.x = element_text(angle = 45),
axis.title = element_text(size = 30),
title = element_text(size = 35),
plot.title = element_text(hjust = .5))
unigram_nrc_breakdown <- unigram_nrc %>%
count(word,
sentiment,
name = "n_words") %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(10) %>%
mutate(word = reorder(word, n_words))
unigram_nrc_breakdown %>%
ggplot(aes(y = n_words,
x = word,
fill = sentiment)) +
geom_col() +
coord_flip() +
facet_wrap(sentiment ~ .,
ncol = 1,
scales = "free_y") +
labs(title = "Top Words per Sentiment",
y = "Frequency",
x = "Word") +
theme_bw() +
theme(axis.text = element_text(size = 30),
axis.title = element_text(size = 30),
title = element_text(size = 35),
plot.title = element_text(hjust = .5),
strip.text = element_text(size = 25),
legend.position = "none")
unigram_nrc_user %>%
group_by(user, sentiment) %>%
summarise(rank = mean(rank),
n = mean(n)) %>%
filter(rank <= 10) %>%
ggplot(aes(x = sentiment,
y = n,
fill = user)) +
geom_col() +
coord_flip() +
facet_wrap(user ~ .,
ncol = 5) +
labs(title = "NRC Sentiment Distribution among Top 10 Users",
y = "Frequency",
x = "Sentiment") +
theme_bw() +
theme(axis.text = element_text(size = 30),
axis.title = element_text(size = 30),
title = element_text(size = 35),
plot.title = element_text(hjust = .5),
strip.text = element_text(size = 25),
legend.position = "none")
unigram_bing %>%
count(sentiment,
name = "n") %>%
ggplot(aes(x = sentiment,
y = n)) +
geom_col() +
labs(title = "Overall Bing Sentiment Distribution",
y = "Frequency",
x = "Sentiment") +
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 15),
title = element_text(size = 15),
plot.title = element_text(hjust = .5))
The following plots uses the “Bing” dictionary in the “tidytext” package to show the percent positive or percent negative within posts across the history of the forum. The first plot shows the posts aggregated by day, and the second plot shows the net postive-negative sentiment per post. Representing the forum by post has two advantages: it shows an alternative way to think of the forum’s flow, and it removes the time gap in the middle to make the plot less elongated. Thinking about the forum in terms of one post following another may be an optimal way to view it. Whoever the most recent poster is, and whenever they are posting, always can see the most recent post even if it was several years ago. The time gap where the forum goes quiet for over a year in the plots above makes the observations of actual posts even harder to see. This plot is still quite difficult to make sense of, but it is better than if there was a large gap from days and months missing.
The net sentiment is the difference between the positive words in a post and the negative words in the post. For the sake of comparison I reduce the positive and negative sentiment of each post to a percentage, with the total number of sentiment-oriented words as the denominator. \(net\,sentiment =\frac{positive\,words}{total\,sentiment\,words} - \frac{negative\,words}{total\,sentiment\,words}\) I only include emotionally charged words in this analysis so that I can determine if a post is mostly negative or mostly postive inasmuch as it is either negative or positive. It would also be reasonable to consider the total non-stop-words as the denominator.
The plots using the Bing dictionary seem to suggest that most days, and indeed most posts, are characterized by being mostly negative, or mostly positive.
## [1] 41159
## [1] 4908 41159
## [1] 278531
## [1] 5010 278531
Next I estimate several Latent Dirichlet allocation (LDA) models for each the unigram and the bigram dataframes created and visualized above. LDA models are an unsupervised machine learning tool to identify topics within text based on the frequency with which words appear together and separately. Consider that all bodies of text are made up of topics, and all topics are made up of a particular mixture of words, LDA analysis tries to identify topics based on the distribution of words within and between documents.
I vary the models below by changing the parameter, ‘k.’ This is the number of topics that define the multinomial distribution the LDA model uses while estimating. The larger the k, the more topics the LDA will generate, and thus the more potential nuance uncovered in the discussion. However, in the context of these posts it would be easy to overfit the corpus of text because the posts are all different lengths, some much much longer than others. Smaller posts have less opportunity to discuss any topic than longer posts. In these preliminary analyses I ignore these concerns, but keep them in mind as a secondary step of analysis. I estimate with k values of 1, 2, 3, 4, 5, and 10 for each the unigram data and the bigram data.
library(tictoc)
tic()
unigram_lda2 <- LDA(unigram_dtm,
k = 2,
control = list(seed = 7117))
toc()
## 32.51 sec elapsed
tic()
unigram_lda3 <- LDA(unigram_dtm,
k = 3,
control = list(seed = 7117))
toc()
## 62.08 sec elapsed
tic()
unigram_lda4 <- LDA(unigram_dtm,
k = 4,
control = list(seed = 7117))
toc()
## 97.5 sec elapsed
tic()
unigram_lda5 <- LDA(unigram_dtm,
k = 5,
control = list(seed = 7117))
toc()
## 126.39 sec elapsed
tic()
unigram_lda10 <- LDA(unigram_dtm,
k = 10,
control = list(seed = 7117))
toc()
## 332.08 sec elapsed
set.seed(7117)
tic()
bigram_lda2 <- LDA(bigram_dtm,
k = 2,
control = list(seed = 7117))
toc()
## 15.17 sec elapsed
tic()
bigram_lda3 <- LDA(bigram_dtm,
k = 3,
control = list(seed = 7117))
toc()
## 24.45 sec elapsed
tic()
bigram_lda4 <- LDA(bigram_dtm,
k = 4,
control = list(seed = 7117))
toc()
## 36.6 sec elapsed
tic()
bigram_lda5 <- LDA(bigram_dtm,
k = 5,
control = list(seed = 7117))
toc()
## 48.33 sec elapsed
tic()
bigram_lda10 <- LDA(bigram_dtm,
k = 10,
control = list(seed = 7117))
toc()
## 120.57 sec elapsed
The two topic LDA model is represented visually below. The first plot shows the top 30 unigrams in each topic and their beta value. The beta value is the probability that a word is generated from a topic. NS has about a 0.024% chance of coming from topic 1, but only a 0.005% chance coming from topic 2, as an example. In the context of two topics it seems insufficient to determine what the topics represent. It is noteworthy that the top words of each overlap quite a bit. Overlapping words include: germany, jews, jewish, world, race, ns, political, government, and others. It is plausible that topic one is more referential to the present state of the NS movement, while topic two is more focused on the state of the NS movement during the times of Nazi Germany and WWII. Or, topic two may be more defined by the German experience with NS because some of the top words such as “der” and “die” are German articles, indicating they paired with a German noun and referenced something from Germany. Another possible label for each topic is that topic one could be more associated with the ideological roots of national socialism and discussing those who its primary enemies, while the second topic seems to have more words discussing economic conditions and ideology.
u_lda2_topics <- tidytext::tidy(unigram_lda2)
u_lda2_topics %>%
group_by(topic) %>%
top_n(30, beta) %>%
ungroup() %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
labs(title = "Top 30 Unigrams per Topic",
x = "Unigram",
y = "Beta") +
theme(axis.title = element_text(size = 35),
title = element_text(size = 40),
plot.title = element_text(hjust = .5),
axis.text = element_text(size = 30),
strip.text = element_text(size = 30))
The second plot that compares the beta’s of the two topics illuminates the story of the two topic LDA a little more. The words most associated with topic two (values greater than zero), are indeed more associated with wartime Germany themes. Topic one now seems to be differentiated by being more ideologically focused. Topic two is certainly more focused on economic concepts, with words like capitalism, money, economics, demand, workers, and other being far more associated with topic two than topic one. Topic one, on the other hand, does seem much more ideologically focused, specifically on white christian topics that they believe are associated with Germany.
u_lda2_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(frac = topic2 / topic1,
log_ratio = log2(frac)) %>%
top_n(40, abs(log_ratio)) %>%
ungroup() %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(y = log_ratio,
x = term)) +
geom_col() +
coord_flip() +
labs(title = "Log2 Ratio of Betas Topic 2 / Topic 1",
x = "Unigram",
y = "Log2(Beta2/Beta1)") +
theme(axis.title = element_text(size = 35),
title = element_text(size = 40),
plot.title = element_text(hjust = .5),
axis.text = element_text(size = 30))
knitr::kable(unigram_count[1:50,] %>% select(-n))
| word | word_stem |
|---|---|
| ns | 41159 |
| people | peopl |
| hitler | 41157 |
| white | white |
| german | german |
| national | nation |
| race | race |
| germany | germani |
| world | world |
| dont | dont |
| time | time |
| government | govern |
| socialism | social |
| system | system |
| war | war |
| political | polit |
| jews | jew |
| nation | nation |
| racial | racial |
| economic | econom |
| thread | thread |
| power | power |
| party | parti |
| jewish | jewish |
| movement | movement |
| life | life |
| true | true |
| aryan | aryan |
| capitalism | capit |
| read | read |
| society | societi |
| jew | jew |
| social | social |
| im | im |
| history | histori |
| means | mean |
| america | america |
| american | american |
| idea | idea |
| socialist | socialist |
| post | post |
| nazi | nazi |
| free | free |
| agree | agre |
| money | monei |
| country | countri |
| real | real |
| understand | understand |
| nature | natur |
| simply | simpli |